12 research outputs found

    A comparison of cache hierarchies for SMT processors

    Get PDF
    In the multithread and multicore era, programs are forced to share part of the processor structures. On one hand, the state of the art in multithreading describes how efficiently manage and distribute inner resources such as reorder buffer or issue windows. On the other hand, there is a substantial body of works focused on outer resources, mainly on how to effectively share last level caches in multicores. Between these ends, first and second level caches have remained apart even if they are shared in most commercial multithreaded processors. This work analyzes multiprogrammed workloads as the worst-case scenario for cache sharing among threads. In order to obtain representative results, we present a sampling-based methodology that for multiple metrics such as STP, ANTT, IPC throughput, or fairness, reduces simulation time up to 4 orders of magnitude when running 8-thread workloads with an error lower than 3% and a confidence level of 97%. With the above mentioned methodology, we compare several state-of-the-art cache hierarchies, and observe that Light NUCA provides performance benefits in SMT processors regardless the organization of the last level cache. Most importantly, Light NUCA gains are consistent across the entire number of simulated threads, from one to eight.Peer ReviewedPostprint (author's final draft

    Gestión de contenidos en caches operando a bajo voltaje

    Get PDF
    La eficiencia energética de las caches en chip puede mejorarse reduciendo su voltaje de alimentación (Vdd ). Sin embargo, este escalado de Vdd está limitado a una tensión Vddmin por debajo de la cual algunas celdas SRAM (Static Random Access Memory) puede que no operen de forma fiable. Block disabling (BD) es una técnica microarquitectónica que permite operar a tensiones muy bajas desactivando aquellas entradas que contienen alguna celda que no opera de forma fiable, aunque a cambio de reducir la capacidad efectiva de la cache. Se utiliza en caches de último nivel (LLC), donde el ahorro potencial es mayor. Sin embargo, para algunas aplicaciones, el incremento de consumo debido a los accesos a memoria fuera del chip no compensa el ahorro energético conseguido en la LLC. Este trabajo aprovecha recursos existentes en los multiprocesadores, como son la jerarqui´a de memoria en chip y el mecanismo de coherencia, para mejorar las prestaciones de BD. En concreto, proponemos explotar la redundancia natural existente en una jerarqui´a de cache inclusiva para mitigar la pérdida de rendimiento debida a la reducción en la capacidad de la LLC. También proponemos una nueva poli´tica de gestión de contenidos consciente de la existencia de entradas de cache defectuosas. Utilizando la información de reúso, el algoritmo de reemplazo asigna entradas de cache operativas a aquellos bloques con más probabilidad de ser referenciados. Las técnicas propuestas llegan a reducir el MPKI hasta en un 36.4 % respecto a block disabling, mejorando su rendimiento entre un 2 y un 13%.Peer ReviewedPostprint (author's final draft

    A fault-tolerant last level cache for CMPs operating at ultra-low voltage

    Get PDF
    Voltage scaling to values near the threshold voltage is a promising technique to hold off the many-core power wall. However, as voltage decreases, some SRAM cells are unable to operate reliably and show a behavior consistent with a hard fault. Block disabling is a micro-architectural technique that allows low-voltage operation by deactivating faulty cache entries, at the expense of reducing the effective cache capacity. In the case of the last-level cache, this capacity reduction leads to an increase in off-chip memory accesses, diminishing the overall energy benefit of reducing the voltage supply. In this work, we exploit the reuse locality and the intrinsic redundancy of multi-level inclusive hierarchies to enhance the performance of block disabling with negligible cost. The proposed fault-aware last-level cache management policy maps critical blocks, those not present in private caches and with a higher probability of being reused, to active cache entries. Our evaluation shows that this fault-aware management results in up to 37.3% and 54.2% fewer misses per kilo instruction (MPKI) than block disabling for multiprogrammed and parallel workloads, respectively. This translates to performance enhancements of up to 13% and 34.6% for multiprogrammed and parallel workloads, respectively.Peer ReviewedPostprint (author's final draft

    DynAMO: Improving parallelism through dynamic placement of atomic memory operations

    Get PDF
    With increasing core counts in modern multi-core designs, the overhead of synchronization jeopardizes the scalability and efficiency of parallel applications. To mitigate these overheads, modern cache-coherent protocols offer support for Atomic Memory Operations (AMOs) that can be executed near-core (near) or remotely in the on-chip memory hierarchy (far). This paper evaluates current available static AMO execution policies implemented in multi-core Systems-on-Chip (SoC) designs, which select AMOs' execution placement (near or far) based on the cache block coherence state. We propose three static policies and show that the performance of static policies is application dependent. Moreover, we show that one of our proposed static policies outperforms currently available implementations. Furthermore, we propose DynAMO, a predictor that selects the best location to execute the AMOs. DynAMO identifies the different locality patterns to make informed decisions, improving AMO latency and increasing overall throughput. DynAMO outperforms the best-performing static policy and provides geometric mean speed-ups of 1.09× across all workloads and 1.31× on AMO-intensive applications with respect to executing all AMOs near.This research was supported by the Spanish Ministry of Science and Innovation (MCIN) through contracts [PID2019-107255GB-C21], [TED2021-132634A-I00], and [PID2019-105660RB-C21]; the Generalitat of Catalunya through contract [2021-SGR-00763]; the Government of Aragon [T5820R]; the Arm-BSC Center of Excellence, and the European Processor Initiative (EPI) which is part of the European Union’s Horizon 2020 research and innovation program under grant agreement No. 826647. V. Soria-Pardos has been supported through an FPU fellowship [FPU20-02132]; A. Armejach is a Serra Hunter Fellow and has been partially supported by the Grant [IJCI-2017-33945] funded by MCIN/AEI/10.13039/501100011033; M. Moreto through a Ramón y Cajal fellowship [RYC-2016-21104].Peer ReviewedPostprint (author's final draft

    A comparison of cache hierarchies for SMT processors

    No full text
    In the multithread and multicore era, programs are forced to share part of the processor structures. On one hand, the state of the art in multithreading describes how efficiently manage and distribute inner resources such as reorder buffer or issue windows. On the other hand, there is a substantial body of works focused on outer resources, mainly on how to effectively share last level caches in multicores. Between these ends, first and second level caches have remained apart even if they are shared in most commercial multithreaded processors. This work analyzes multiprogrammed workloads as the worst-case scenario for cache sharing among threads. In order to obtain representative results, we present a sampling-based methodology that for multiple metrics such as STP, ANTT, IPC throughput, or fairness, reduces simulation time up to 4 orders of magnitude when running 8-thread workloads with an error lower than 3% and a confidence level of 97%. With the above mentioned methodology, we compare several state-of-the-art cache hierarchies, and observe that Light NUCA provides performance benefits in SMT processors regardless the organization of the last level cache. Most importantly, Light NUCA gains are consistent across the entire number of simulated threads, from one to eight.Peer Reviewe

    An adaptive controller to save dynamic energy in LP-NUCA

    Get PDF
    Portable devices often demand powerful processors to run computing intensive applications, such as video playing or gaming, and ultra low en-ergy consumption to extend device uptime. Such con-flicting requirements are hard to fulfil and appeal for adaptive hardware that only consumes energy when required. LP-NUCA is a tiled cache organization aimed at high-performance low-power embedded processors that sequentially looks up for blocks ordered by tem-poral locality in groups of small tiles. Unfortunately, LP-NUCA has two main dynamic energy wasting sources: (a) blocks are continuously migrating among tiles even in low locality phases, (b) to reduce cache latency, the tag and data arrays of the tiles are always accessed in parallel. This paper proposes a learning-based controller that dynamically tunes block migration and cache ac-cess policy between parallel and serial. During low temporal locality phases the controller drops blocks from the LP-NUCA root tile, L1, and forces a serial access to the tag and data arrays in the tiles, thus reducing the energy waste. Using a cycle-accurate simulator and energy estimations derived from an LP-NUCA layout, the proposed controller reduces dy-namic energy by 20% on average for single and multi-thread workloads.Peer Reviewe

    An adaptive controller to save dynamic energy in LP-NUCA

    No full text
    Portable devices often demand powerful processors to run computing intensive applications, such as video playing or gaming, and ultra low en-ergy consumption to extend device uptime. Such con-flicting requirements are hard to fulfil and appeal for adaptive hardware that only consumes energy when required. LP-NUCA is a tiled cache organization aimed at high-performance low-power embedded processors that sequentially looks up for blocks ordered by tem-poral locality in groups of small tiles. Unfortunately, LP-NUCA has two main dynamic energy wasting sources: (a) blocks are continuously migrating among tiles even in low locality phases, (b) to reduce cache latency, the tag and data arrays of the tiles are always accessed in parallel. This paper proposes a learning-based controller that dynamically tunes block migration and cache ac-cess policy between parallel and serial. During low temporal locality phases the controller drops blocks from the LP-NUCA root tile, L1, and forces a serial access to the tag and data arrays in the tiles, thus reducing the energy waste. Using a cycle-accurate simulator and energy estimations derived from an LP-NUCA layout, the proposed controller reduces dy-namic energy by 20% on average for single and multi-thread workloads.Peer Reviewe

    A comparison of cache hierarchies for SMT processors

    No full text
    In the multithread and multicore era, programs are forced to share part of the processor structures. On one hand, the state of the art in multithreading describes how efficiently manage and distribute inner resources such as reorder buffer or issue windows. On the other hand, there is a substantial body of works focused on outer resources, mainly on how to effectively share last level caches in multicores. Between these ends, first and second level caches have remained apart even if they are shared in most commercial multithreaded processors. This work analyzes multiprogrammed workloads as the worst-case scenario for cache sharing among threads. In order to obtain representative results, we present a sampling-based methodology that for multiple metrics such as STP, ANTT, IPC throughput, or fairness, reduces simulation time up to 4 orders of magnitude when running 8-thread workloads with an error lower than 3% and a confidence level of 97%. With the above mentioned methodology, we compare several state-of-the-art cache hierarchies, and observe that Light NUCA provides performance benefits in SMT processors regardless the organization of the last level cache. Most importantly, Light NUCA gains are consistent across the entire number of simulated threads, from one to eight.Peer Reviewe

    On the use of many-core Marvell ThunderX2 processor for HPC workloads

    No full text
    Marvell’s ThunderX2 has been the first Arm-based processor with deployments in large-scale HPC production systems, challenging the dominance that x86 processors had in the last decades. While x86 processors and its software stack have been characterized in detail, the behavior of Arm counterparts is not well known, limiting its adoption. This work methodically characterizes performance and power efficiency of the ThunderX2 running different HPC workloads compiled with two state-of-the-art compilers, GCC and Arm HPC Compiler. We study the maturity of available compilers and find that the Arm HPC Compiler is able to apply additional optimizations, resulting in better performance than GCC. In addition, we also compare both performance and power with respect to an Intel Skylake processor. Despite the faster single thread performance of Skylake, ThunderX2 is able to match performance on multi-threaded workloads due to its superior memory bandwidth. However, power efficiency of ThunderX2 is far from matching Skylake-based processors when AVX512 extensions are used.Peer ReviewedPostprint (author's final draft

    Gestión de contenidos en caches operando a bajo voltaje

    No full text
    La eficiencia energética de las caches en chip puede mejorarse reduciendo su voltaje de alimentación (Vdd ). Sin embargo, este escalado de Vdd está limitado a una tensión Vddmin por debajo de la cual algunas celdas SRAM (Static Random Access Memory) puede que no operen de forma fiable. Block disabling (BD) es una técnica microarquitectónica que permite operar a tensiones muy bajas desactivando aquellas entradas que contienen alguna celda que no opera de forma fiable, aunque a cambio de reducir la capacidad efectiva de la cache. Se utiliza en caches de último nivel (LLC), donde el ahorro potencial es mayor. Sin embargo, para algunas aplicaciones, el incremento de consumo debido a los accesos a memoria fuera del chip no compensa el ahorro energético conseguido en la LLC. Este trabajo aprovecha recursos existentes en los multiprocesadores, como son la jerarqui´a de memoria en chip y el mecanismo de coherencia, para mejorar las prestaciones de BD. En concreto, proponemos explotar la redundancia natural existente en una jerarqui´a de cache inclusiva para mitigar la pérdida de rendimiento debida a la reducción en la capacidad de la LLC. También proponemos una nueva poli´tica de gestión de contenidos consciente de la existencia de entradas de cache defectuosas. Utilizando la información de reúso, el algoritmo de reemplazo asigna entradas de cache operativas a aquellos bloques con más probabilidad de ser referenciados. Las técnicas propuestas llegan a reducir el MPKI hasta en un 36.4 % respecto a block disabling, mejorando su rendimiento entre un 2 y un 13%.Peer Reviewe
    corecore